Variable selection in model-based clustering: A general variable role modeling

نویسندگان

Cathy Maugis

Gilles Celeux

Marie-Laure Martin-Magniette

چکیده

The currently available variable selection procedures in model-based clustering assume that the irrelevant clustering variables are all independent or are all linked with the relevant clustering variables. We propose a more versatile variable selection model which describes three possible roles for each variable: The relevant clustering variables, the irrelevant clustering variables dependent on a part of the relevant clustering variables and the irrelevant clustering variables totally independent of all the relevant variables. A model selection criterion and a variable selection algorithm are derived for this new variable role modeling. The model identifiability and the consistency of the variable selection criterion are also established. Numerical experiments highlight the interest of this new modeling. Key-words: Relevant, redundant or independent variables, Variable selection, Model-based clustering, Linear regression, BIC ∗ Université Paris-Sud 11,Projet select † INRIA Saclay Île-de-France, Projet select, Université Paris-Sud 11 ‡ UMR AgroParisTech/INRA MIA 518, Paris § URGV UMR INRA 1165, CNRS 8114, UEVE, Evry Sélection de variables pour la classification non supervisée par mélanges gaussiens : une modélisation générale du rôle des variables Résumé : Les procédures de sélection de variables actuellement disponibles en classification non supervisée par mélanges gaussiens supposent que les variables non significatives pour la classification sont toutes indépendantes ou sont toutes liées aux variables significatives. Nous proposons un modèle de sélection de variables plus général qui permet pour chaque variable d’être une variable significative pour la classification, d’être non significative mais dépendante d’une partie ou de toutes les variables significatives ou d’être non significative et indépendante des variables significatives. Le critère de sélection de modèles et l’algorithme de sélection de variables sont établis pour cette nouvelle modélisation. L’identifiabilité des modèles et la consistance du critère de sélection sont également établis. Des exemples numériques mettent en évidence l’intérêt de cette nouvelle modélisation. Mots-clés : Variables significatives, redondantes ou indépendantes, Sélection de variables, Classification non supervisée, Mélanges gaussiens, Régression linéaire, BIC Variable selection in model-based clustering: A general variable role modeling 3

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Multivariate Estimation of Rock Mass Characteristics Respect to Depth Using ANFIS Based Subtractive Clustering- Khorramabad- Polezal Freeway Tunnels

Combination of Adoptive Network based Fuzzy Inference System (ANFIS) and subtractive clustering (SC) has been used for estimation of deformation modulus (Em) and rock mass strength (UCSm) considering depth of measurement. To do this, learning of the ANFIS based subtractive clustering (ANFISBSC) was performed firstly on 125 measurements of 9 variables such as rock mass strength (UCSm), deformati...

متن کامل

Item Response Theory Modeling for Microarray Gene Expression Data

The high dimensionality of global gene expression profiles, where number of variables (genes) is very large compared to the number of observations (samples), presents challenges that affect generalizability and applicability of microarray analysis. Latent variable modeling offers a promising approach to deal with high-dimensional microarray data. The latent variable model is based on a few late...

متن کامل

Investigating the Effects of Economic Shocks with housing Finance in a DSGE Model

With the introduction of any given variable in random dynamic general equilibrium models, the behavioral functions of economic agents will change, and consequently the shocks to the economy show a different direction and response. One of the most important sectors in the dynamics of a countrychr('39')s economy is the housing sector and one of the main determinants of the stagnation and boom of ...

متن کامل

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...

متن کامل

Selection of Variables that Influence Drug Injection in Prison: Comparison of Methods with Multiple Imputed Data Sets

Background: Prisoners, compared to the general population, are at greater risk of infection. Drug injection is the main route of HIV transmission, in particular in Iran. What would be of interest is to determine variables that govern drug injection among prisoners. However, one of the issues that challenge model building is incomplete national data sets. In this paper, we addressed the process ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

Computational Statistics & Data Analysis

دوره 53 شماره

صفحات -

تاریخ انتشار 2009

Variable selection in model-based clustering: A general variable role modeling

نویسندگان

چکیده

منابع مشابه

Multivariate Estimation of Rock Mass Characteristics Respect to Depth Using ANFIS Based Subtractive Clustering- Khorramabad- Polezal Freeway Tunnels

Item Response Theory Modeling for Microarray Gene Expression Data

Investigating the Effects of Economic Shocks with housing Finance in a DSGE Model

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Selection of Variables that Influence Drug Injection in Prison: Comparison of Methods with Multiple Imputed Data Sets

عنوان ژورنال:

اشتراک گذاری